I'm Julia.
%pylab inline
import pandas as pd
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)
Know how to use pandas to answer some specific questions about a dataset
Roadmap:
sudo apt-get install ipython-notebook
pip install ipython tornado pyzmq
or install Anaconda from http://store.continuum.io (what I do)
You can start IPython notebook by running
ipython notebook --pylab inline
# Download and read the data
!wget "http://bit.ly/311-data-tar-gz" -O 311-data.tar.gz
!wget "https://raw2.github.com/jvns/talks/master/pyladiesfeb2014/tiny.csv" -O tiny.csv
!tar -xzf "311-data.tar.gz" # wget does different things
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
plot(orig_data['Longitude'], orig_data['Latitude'], '.', color="purple")
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.
If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.
people = pd.read_csv('tiny.csv')
people
This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 6 rows.
I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.
# Load the first 5 rows of our CSV
small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)
# How to get a column
small_requests['Complaint Type']
# How to get a subset of the columns
small_requests[['Complaint Type', 'Created Date']]
# How to get 3 rows
small_requests[:3]
small_requests['Agency Name'][:3]
small_requests[:3]['Agency Name']
small_requests['Complaint Type']
# This is like our numpy example from before
small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'
That's numpy in action! Using ==
on a column of a dataframe gives us a series of True
and False
values
# This is like our numpy example earlier
noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints
Any Dataframe has an index, which is a integer or date or something else associated to each row.
# How to get a specific row
small_requests.ix[0]
# How not to get a row
small_requests[0]
# Your code here
# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.
#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
noise_complaints[:3]
noise_complaints = noise_complaints.set_index('Created Date')
noise_complaints[:3]
Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]
noise_complaints.resample('H', how=len)[:3]
noise_complaints.resample('H', how=len).plot()
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
orig_data['Complaint Type'].value_counts()
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
# Your code here.
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index("Created Date")
noise_complaints['weekday'] = noise_complaints.index.weekday
noise_complaints[:3]
# Count the complaints by weekday
counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)
counts_by_weekday
# change the index to be actual days
counts_by_weekday.index = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
counts_by_weekday.plot(kind='bar')
# Your code here
# We need to get rid of the NA values for this to work
street_names = orig_data['Street Name'].fillna('')
manhattan_streets = street_names[street_names.str.contains("MANHATTAN")]
manhattan_streets
manhattan_streets.value_counts()
# Our current latitude and longitude
our_lat, our_long = 40.714151,-74.00878
distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2
pd.Series(distance_from_us).hist()
close_complaints = orig_data[distance_from_us < 0.00005]
close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')